Incremental All Pairs Similarity Search for Varying Similarity Thresholds with Reduced I/O Overhead

نویسندگان

  • Amit C. Awekar
  • Nagiza F. Samatova
  • Paul Breimyer
چکیده

All Pairs Similarity Search (APSS) is the problem of finding all pairs of records with similarity scores above a specified threshold. Incremental All Pairs Similarity Search (IAPSS) is the problem of performing APSS multiple times over the same dataset by varying the similarity threshold. This problem is ubiquitous in many real-world systems like search engines, online social networks, and digital libraries. A significant part of the computation is redundant across multiple invocations of APSS. Our solution to the IAPSS problem avoids these redundant computations by storing the history of previous APSS invocations and splitting the inverted index that maps each dimension into a list of records that have non-zero projections along that dimension. The size of the computation history increases quadratically with the number of records in the dataset. We introduce the concept of a similarity floor to store partial computation history, resulting in reduced I/O overhead. We empirically evaluate the effectiveness of our techniques using four real-world large-scale datasets. Our IAPSS solution achieves speed-ups in the order of 2X to over 105 X over the state-of-the-art APSS algorithm, while reducing the size of the computation history by at least an order of magnitude.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimal Dimension Order: A Generic Technique for the Similarity Join

The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs where the distance does not exceed a given Parameter ε. Although the similarity join is clearly CP...

متن کامل

A Partitioned Similarity Search with Cache-Conscious Data Traversal

All pairs similarity search (APSS) is used in many web search and data mining applications. previous work has used techniques such as comparison filtering, inverted indexing, and parallel accumulation of partial results. However, shuffling intermediate results can incur significant communication overhead as data scales up. This paper studies a scalable two-phase approach called Partition-based ...

متن کامل

Scaling up top-K cosine similarity search

Article history: Received 21 September 2009 Received in revised form 23 August 2010 Accepted 23 August 2010 Available online 8 September 2010 Recent years have witnessed an increased interest in computing cosine similarity in many application domains. Most previous studies require the specification of a minimum similarity threshold to perform the cosine similarity computation. However, it is us...

متن کامل

Scaling up all pairs similarity search pdf

Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity.ABSTRACT. Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity. Scaling up all pairs similarity search, Published by ACM. The problem of finding a...

متن کامل

Massively-Parallel Similarity Join, Edge-Isoperimetry, and Distance Correlations on the Hypercube

We study distributed protocols for finding all pairs of similar vectors in a large dataset. Our results pertain to a variety of discrete metrics, and we give concrete instantiations for Hamming distance. In particular, we give improved upper bounds on the overhead required for similarity defined by Hamming distance r > 1 and prove a lower bound showing qualitative optimality of the overhead req...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009